A decoupled approach to high-level loop optimization : tile shapes, polyhedral building blocks and low-level compilers. (Une approche découplée pour l'optimization de boucle à haut niveau)

نویسنده

  • Tobias Grosser
چکیده

Despite decades of research on high-level loop optimizations and their successful integration in production C/C++/FORTRAN compilers, most compiler internal loop transformation systems only partially address the challenges posed by the increased complexity and diversity of today’s hardware. Especially when exploiting domain specific knowledge to obtain optimal code for complex targets such as accelerators or many-cores processors, many existing loop optimization frameworks have difficulties exploiting this hardware. As a result, new domain specific optimization schemes are developed independently without taking advantage of existing loop optimization technology. This results both in missed optimization opportunities as well as low portability of these optimization schemes to different compilers. One area where we see the need for better optimizations are iterative stencil computations, an important computational problem that is regularly optimized by specialized, domain specific compilers, but where generating efficient code is difficult. In this work we present new domain specific optimization strategies that enable the generation of high-performance GPU code for stencil computations. Different to how most existing domain specific compilers are implemented, we decouple the high-level optimization strategy from the low-level optimization and specialization necessary to yield optimal performance. As high-level optimization scheme we present a new formulation of split tiling, a tiling technique that ensures reuse along the time dimension as well as balanced coarse grained parallelism without the need for redundant computations. Using split tiling we show how to integrate a domain specific optimization into a general purpose C-to-CUDA translator, an approach that allows us to reuse existing non-domain specific optimizations. We then evolve split tiling into a hybrid hexagonal/parallelogram tiling scheme that allows us to generate code that even better addresses GPU specific concerns. To conclude our work on tiling schemes we investigate the relation between diamond and hexagonal tiling. Starting with a detailed analysis of diamond tiling including the requirements it poses on tile sizes and wavefront coefficients, we provide a unified formulation of hexagonal and diamond tiling which enables us to perform hexagonal tiling for two dimensional problems (one time, one space) in the context of a general purpose optimizer such as Pluto. Finally, we use this formulation to evaluate hexagonal and diamond tiling in terms of compute-to-communication and compute-to-synchronization ratios. In the second part of this work, we discuss our contributions to important infrastructure components, our building blocks, that en-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyse de Programmes Malveillants par Abstraction de Comportements. (Analysis of Malware by Behavior Abstraction)

ion de Comportements par Réécriture de Mots Nous avons vu, en Section 1.2, que l’analyse comportementale classique opérait directement au niveau des interactions observées (les appels de librairie, les appels systèmes...), ce qui rend la détection de comportements suspects peu robuste puisque la moindre modification dans la mise en œuvre d’une fonctionnalité permet de faire échouer la détection...

متن کامل

Model Transformations from a Data Parallel Formalism towards Synchronous Languages

The increasing complexity of embedded system designs calls for high-level specification formalisms and for automated transformations towards lower-level descriptions. In this report, a metamodel and a transformation chain are defined from a high-level modeling framework, Gaspard, for data-parallel systems towards a formalism of synchronous equations. These equations are translated in synchronou...

متن کامل

Comparaison de BTD avec des stratégies d ’ exploration “ intelligentes ” pour une sélection automatique d ’ algorithmes

Nous considérons un solveur générique de problèmes de satisfaction de contraintes (CSP) binaires, paramétré par des choix de haut niveau, à savoir le type de recherche, le niveau de propagation de contraintes et l’heuristique de choix de variables. Nous comparons expérimentalement 18 configurations de ce solveur générique sur plus d’un millier d’instances. Un premier but est de comprendre la co...

متن کامل

Optimizing DDR-SDRAM Communications at C-level for Automatically-Generated Hardware Accelerators

High-level synthesis tools are now getting more mature for generating hardware accelerators with an optimized internal structure, thanks to efficient scheduling techniques, resource sharing, and finite-state machines generation. However, interfacing them with the outside world, i.e., integrating the automatically-generated hardware accelerators within the complete design, with optimized communi...

متن کامل

Advances in Bit Width Selection Methodology

We describe a method for the formal determination of signal bit width in fixed points VLSI implementations of signal processing algorithms containing loop nests. The main advance of this paper lies in the fact that we use results of the (max,+) algebraic theory to find the integral bit width of algorithms containing loop nests whose bound parameters are not statically known. Combined with recen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014